\section{Conclusion}\label{sec:conclusion}

We introduce GLUE, 
a platform and collection of resources for evaluating and analyzing natural language understanding systems. 
We find that, in aggregate, models trained jointly on our tasks see better performance than the combined performance of models trained for each task separately.
We confirm the utility of attention mechanisms and transfer learning methods such as ELMo in NLU systems, which combine to outperform the best sentence representation models on the GLUE benchmark, but still leave room for improvement.
When evaluating these models on our diagnostic dataset, we find that they fail (often spectacularly) on many linguistic phenomena, suggesting possible avenues for future work.
In sum, the question of how to design general-purpose NLU models remains unanswered, and we believe that GLUE can provide fertile soil for addressing this challenge.